04:00
Preparing for Production
Optional
Reproducibility
Explainability
Scaling
{tictoc} and {profvis}Read the LIME paper, which we will discuss during the live session.
Work through the understanding LIME R tutorial.
Use code profiling tools to assess the performance of your rolling_mean() and rolling_sd() functions. Identify any efficiencies that can be made.
Simulate a HPP \(\{X_1, \ldots, X_N\}\) on \([0,T]\) in two ways:
First use \(X_{i+1} - X_{i} \sim \text{Exp}(\lambda)\).
Second use \(N \sim \text{Pois}(\lambda)\) and \(X_i | N = n \overset{\mathrm{i.i.d}}{\sim} \text{Unif}(0, T)\).
Evaluate and compare the reproducibility and scalability of each implementation.
rolling_mean() and rolling_sd() functions?04:00
rolling_mean()Calculate weekly mean on 1 year of data:
Error in parse_rprof_lines(lines, expr_source) :
No parsing data available. Maybe your function was too fast?
rolling_mean()Solutions:
{microbenchmark} if you are really interested in instances of this size.04:00
Drop and append to reduce copying
Use data structure / language that allows modification in place
Switch to “Online” rolling mean estimate
Any other ideas not already mentioned?
Compare the time it takes to sum the rows of a matrix using a for loop versus using the built-in function rowSums().
# For loop
system.time({
row_sums_X <- rep(NA, nrow(X))
for (i in seq_along(row_sums_X)) {
row_sums_X[i] <- sum(X[i, ])
}
}) user system elapsed
0.803 0.012 0.823
For Loop:
user system elapsed
0.880 0.014 0.915
Built in function from {Matrix}:
user system elapsed
0.010 0.002 0.011
Task 1
The output is three values for ‘user’ ‘system’ and ‘elapsed’. Using the documentation for system.time() investigate what these mean.
Extension: What happens if we consider a “wide” or “square-ish” matrix, rather than a “tall” matrix.
03:00
“
system.time()calls the functionproc.time(), evaluatesexpr, and then callsproc.time()once more, returning the difference between the twoproc.time()calls.” - - system.time {base}
“
proc.time()returns five elements for backwards compatibility, but its print method prints a named vector of length 3. The first two entries are the total user and system CPU times of the current R process and any child processes on which it has waited, and the third entry is the ‘real’ elapsed time since the process was started.” - proc.time, {base}
Note:
“Timings of evaluations of the same expression can vary considerably depending on whether the evaluation triggers a garbage collection. When
gcFirst = TRUEa garbage collection (gc) will be performed immediately before the evaluation ofexpr. This will usually produce more consistent timings.” - system.time {base}
Investigate whether using X[i, 1] + X[i, 2] rather than sum(X[i, ]) significantly alters any of these values.
05:00
In this case, calling the sum() function is less efficient than doing two look ups and an addition within R.
This is interesting because it goes against the standard advice that vectorised or compiled code will generally be faster.
Below are three ways to create a vector of Gaussian random variates. Measure how long each approach takes and explain the ordering that you find.
08:00
Let’s simplify things by defining each of these as a function.
Preallocation improves over growing the vector because it allows modification in place. For a detailed discussion of R’s copy on modify behaviour see Advanced R Chapter 2.
The vectorised code essentially moves the for loop to C. This is not only a faster language but also reduces the number of call that have to be made to C functions.
The following code will simulate a homogeneous Poisson process on the interval \([0,t_{\text{max}}]\) with rate \(\lambda = 0.2\).
Each event has an mark associated with it, drawn from a shifted exponential distribution.
t_max <- 100
lambda <- 0.2
event_times <- c()
event_marks <- c()
t <- 0
time_to_first_event <- rexp(n = 1, rate = lambda)
t <- t + time_to_first_event
while (t < t_max) {
event_times <- c(event_times, t)
event_marks <- c(event_marks, 3 + rexp(n = 1, rate = 1))
time_to_next_event <- rexp(n = 1, rate = lambda)
t <- t + time_to_next_event
}
point_pattern <- data.frame(event_times, event_marks)Task 4.1: Convert this code to a function.
t_max <- 100
lambda <- 0.2
event_times <- c()
event_marks <- c()
t <- 0
time_to_first_event <- rexp(n = 1, rate = lambda)
t <- t + time_to_first_event
while (t < t_max) {
event_times <- c(event_times, t)
event_marks <- c(event_marks, 3 + rexp(n = 1, rate = 1))
time_to_next_event <- rexp(n = 1, rate = lambda)
t <- t + time_to_next_event
}
point_pattern <- data.frame(event_times, event_marks)03:00
sim_hpp <- function(lambda = 0.2, t_max = 100, offset = 3){
event_times <- c()
event_marks <- c()
t <- 0
time_to_first_event <- rexp(n = 1, rate = lambda)
t <- t + time_to_first_event
while (t < t_max) {
event_times <- c(event_times, t)
event_marks <- c(event_marks, offset + rexp(n = 1, rate = 1))
time_to_next_event <- rexp(n = 1, rate = lambda)
t <- t + time_to_next_event
}
point_pattern <- data.frame(time = event_times, mark = event_marks)
# return the data.frame so that it can be stored, but do not print to console
invisible(point_pattern)
}Task 4.2: Create a plot showing how the runtime of this code changes as \(t_{\text{max}}\) increases.
05:00
Time the code:
t_max_values <- seq(5000, 50000, by = 1000)
run_times <- rep(NA, length(t_max_values))
for (run in seq_along(t_max_values)) {
tic(quiet = TRUE)
sim_hpp(t_max = t_max_values[run])
timer <- toc(quiet = TRUE)
run_times[run] <- timer$toc - timer$tic
#print(paste("Completed run with t_max = ", t_max_values[run]))
}Create the plot:
Simulate an earthquake catalogue on [0, 100000] with rate \(\lambda= 0.2\).
Use profvis::profvis() to identify bottlenecks in your code.
Note: We need to remember to source the function from a separate script so that we get a line-by-line breakdown.
05:00
Rewrite the code to be more efficient and create a second plot comparing the runtime of your new function to the original.
10:00
Solving this bottleneck is complicated by the fact that we do not know how many observations we will have before running the code.
We know that our total number of events \(N \sim \text{Poisson}(\lambda t_\text{max}\)), this means we can (with very low probability) have arbitrarily large number of events being simulated.
One solution is to pre-allocate storage that needs to be extended on only 1 in 1000 runs of the function and which further extends that storage as required.
sim_hpp_preallocated <- function(lambda = 0.2, t_max = 100, offset = 3){
initial_storage_size <- qpois(p = 0.999, lambda = lambda * t_max)
event_times <- rep(NA, initial_storage_size)
event_marks <- rep(NA, initial_storage_size)
t <- 0
time_to_first_event <- rexp(n = 1, rate = lambda)
t <- t + time_to_first_event
event_count <- 0
while (t < t_max) {
# increment event counter and record details of this event
event_count <- event_count + 1
event_times[event_count] <- t
event_marks[event_count] <- offset + rexp(n = 1, rate = 1)
# calculate time of next event
time_to_next_event <- rexp(n = 1, rate = lambda)
t <- t + time_to_next_event
}
hpp <- data.frame(event_times, event_marks)
# keep only rows that have been filled in
hpp[!is.na(hpp$event_times), ]
}We can then compare the time it takes to run each function as \(t_{\text{max}}\) increases.
Create the plot of runtimes
Often using different data structures or simulation techniques is much more effective than trying to wring every last drop of efficiency from your current implementation.
Task 4.5: compare this to an implementation that uses the Poisson distribution of the total event count to first simulate the number of events and then randomly allocates locations over the interval.
10:00
We can compare runtime in a similar way, and find that this scales even more favourably than our previous approach.
run_times_3 <- rep(NA, length(t_max_values))
for (run in seq_along(t_max_values)) {
set.seed(1234)
tic(quiet = TRUE)
sim_3 <- sim_hpp_conditional(t_max = t_max_values[run])
timer_3 <- toc(quiet = TRUE)
#print(paste("Completed run with t_max = ", t_max_values[run]))
run_times_3[run] <- timer_3$toc - timer_3$tic
}Note: It is no longer possible to ensure that the same number of points are generated in each iteration, because we are not generating random variates in the same order. We might want to further investigate whether this improvement is stable across many realisations.
Question: What is going on with \(t_{\text{max}} = 5000\)?
Profiling is a useful tool for identifying code bottlenecks that become large problems in production.
Tips
06:00
Effective Data Science: Production - Live Session - Zak Varty